Unicode and Character Sets - Learning C | Vicki Boykis

[Joel on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) C can't represent ![:sweat_smile:](https://a.slack-edge.com/production-standard-emoji-assets/14.0/apple-medium/[email protected]) because it only understands ASCII by default ASCII was the original representation of the alphabet Evolution of character encodings: ASCII ---> Unicode --> UTF-8 (a way to store Unicode) > Back in the semi-olden days, when Unix was being invented and K&R were writing [The C Programming Language](http://cm.bell-labs.com/cm/cs/cbook/), everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called [ASCII](http://www.robelle.com/library/smugbook/ascii.html) which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called _unprintable_ and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in. > In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (), so when Americans would send their résumés to Israel they would arrive as rsums. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents.